Skip to content

Conversation

@damian0815
Copy link
Contributor

@damian0815 damian0815 commented Oct 17, 2025

Port cc-mrjob/sitemaps_from_robotstxt.py.

  • Basic functionality unit tests
  • warcio implementation
    • Validate output is identical to MRJob output with "test" robotstxt in MRJob repo
    • Validate on recent full-scale crawl output
  • fastwarc implementation
  • unit test to validate text encoding edge cases and validity (currently all test cases are completely valid utf8)
  • check output works with crawl-tools/server/seed/sitemaps/sitemaps_robotstxt.py

@damian0815 damian0815 marked this pull request as draft October 17, 2025 15:27
@damian0815 damian0815 marked this pull request as ready for review October 20, 2025 14:24
Signed-off-by: Damian Stewart <[email protected]>
Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass. I'll continue with testing. But several points need a discussion.

Signed-off-by: Damian Stewart <[email protected]>
Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Some minor things remain to do.

Tested on my local machine:

  • successfully run the unit tests using both the pyspark module or the installed Spark. For the latter, it's required to set PYTHONPATH=$PWD/test:$(ls $SPARK_HOME/python/lib/py4j-*-src.zip):$SPARK_HOME/python:$PYTHONPATH
  • successfully tested sitemaps_from_robotstxt.py. Output looks good. On a small test sample, there are no differences in the number of extracted sitemap URLs with the cc-mrjob implementation.
  • failed to run sitemaps_from_robotstxt_fastwarc.py

requirements.txt Outdated
orjson
warcio

# for validating URLs in robots.txt:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

from typing import Optional
from urllib.parse import urlparse, urljoin

import validators
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

def test_host_accumulation_same_host(spark):
"""
Test accumulation of hosts when sitemap url host and robots.txt url host match
Requires test/ on PYTHONPATH so utils._process_jobs can be imported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 4-line function could be inlined to avoid the import error.

However, test/ is required on the path also to load test_sitemaps_from_robotstxt which is required for Spark serialization. So, you need to run it like

PYTHONPATH=$PYTHONPATH:./test python -m pytest test -v

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README has been updated with this information

robots_txt_with_more_than_50_sitemaps = None


def init_accumulators(self, session):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also log_accumulators needs to be overridden, otherwise the class-specific accumulators are never shown resp. not preserved once the job has finished. See cc_index_word_count.py or wat_extract_links.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# process only WARC response and metadata (including WAT) records
fastwarc_record_filter = WarcRecordType.response

# process_record is implemented by SitemapExtractorJob
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "main" block is required in order to run the job.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, also run the job to verify that there are no errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


if robots_txt_url is None:
# first sitemap found: set base URL and get host from URL
robots_txt_url = record.rec_headers['WARC-Target-URI']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not compatible with FastWARC, should be self.get_warc_header(record, 'WARC-Target-URI')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, @damian0815!

Unit tests pass, successfully run both versions (warcio and fastwarc) locally.

I'll also run the job on a real cluster later today and merge the PR if this test passes as well.

Would you mind to squash the commits to a small and meaningful number? But I can do it when merging as well. Thanks!

@damian0815
Copy link
Contributor Author

I think it's easier/cleaner if you squash when merging?

Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've successfully run sitemaps_from_robotstxt_fastwarc.py on 5% of the robots.txt captures of CC-MAIN-2025-43 using a single-node Hadoop cluster (Spark on YARN):

13:57:14.391 [Thread-5] INFO  SitemapExtractorFastWarc - WARC/WAT/WET input files processed = 5000
13:57:14.394 [Thread-5] INFO  SitemapExtractorFastWarc - WARC/WAT/WET input files failed = 0
13:57:14.396 [Thread-5] INFO  SitemapExtractorFastWarc - WARC/WAT/WET records processed = 4444146
13:57:14.398 [Thread-5] INFO  SitemapExtractorFastWarc - robots.txt successfully parsed = 4444146
13:57:14.401 [Thread-5] INFO  SitemapExtractorFastWarc - sitemap urls found = 3949561
13:57:14.403 [Thread-5] INFO  SitemapExtractorFastWarc - sitemap urls with invalid utf-8 encoding = 76
13:57:14.405 [Thread-5] INFO  SitemapExtractorFastWarc - robots.txt announcing at least 1 sitemap = 1885795
13:57:14.408 [Thread-5] INFO  SitemapExtractorFastWarc - robots.txt with more than 50 sitemaps = 3547

I'll merge the code. Thanks, @damian0815!

@sebastian-nagel sebastian-nagel merged commit cc70f85 into main Nov 1, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants